Lab 7
library(tidyverse)
library(mvtnorm)
library(plotly)
library(MASS)
library(ggplot2)Multivarite Normal
1. Simulation of bivariate normal distribution
- First let’s Check the correctness of the formula we did in class.
N= 10000
mymux = 1
mysigx = .5
mymuy = 2
mysigy = 1.5
myrho = .5 #positive moderate correlation
Z1 = rnorm(N)
Z2 = rnorm(N)
x <- mymux + mysigx*Z1
y <- mymuy + mysigy*(myrho*Z1 + sqrt(1-myrho^2)*Z2)
mean(y) # should be ~ 2## [1] 1.985657
sd(y) # should be ~ 1.5 ## [1] 1.502036
cor(x,y) # should be ~ .5## [1] 0.4877044
plot(x,y,pch = 46)#positive moderate correlationWe see that mean and standard deviation of y and the correlation coefficient are all close to the theoretical values.
- Write a function that does the simulation using the formula that we found in class (checked above)
# Simulate bivariate normal distribution
mybivar = function(size = 1, mu1=0, mu2 = 0, sig1 = 1, sig2 = 1, rho = 0){
mydf = data.frame(x = rep(NA,size), y = rep(NA,size))
z1 = rnorm(size)
z2 = rnorm(size)
x = sig1*z1 + mu1
y = sig2*(rho*z1 + sqrt(1-rho^2)*z2) + mu2
return(data.frame(x = x, y = y))
}- Try it out and plot the result.
N = 5000
mymux = 1
mymuy = 2
mysigx = .5
mysigy = 1
myrho = -.8 #-negative high correlation
mydf = mybivar(N, mymux, mymuy, mysigx, mysigy, myrho)
head(mydf)## x y
## 1 0.8526927 2.2440769
## 2 0.3472761 3.0290779
## 3 0.6698738 2.5912914
## 4 2.0943417 0.8984809
## 5 0.9016436 2.0544028
## 6 0.8070506 2.7382341
plot(mydf, cex = .1) #-negative high correlation#or
plot(mydf, pch = 46)
grid(col = 3) Try different values of myrho to see how the shape of the scatter plot changes with the correlation coefficient.
- Let’s try the 3d plot
dens <- kde2d(mydf$y,mydf$x)
plot_ly(x = dens$x,
y = dens$y,
z = dens$z) %>% add_surface() #z- An n[1] by n[2] matrix of the estimated density: rows correspond to the value of x, columns to the value of y.- 3D scatter plot
# threejs Javascript plot
library(threejs)
# Unpack data from kde grid format
x <- dens$x; y <- dens$y; z <- dens$z
# Construct x,y,z coordinates
xx <- rep(x,times=length(y))
yy <- rep(y,each=length(x))
zz <- z; dim(zz) <- NULL
# Set up color range
ra <- ceiling(16 * zz/max(zz))
col <- rainbow(16, 2/3)
# 3D interactive scatter plot
scatterplot3js(x=xx,y=yy,z=zz,size=0.4,color = col[ra],bg="black")2. Multivariate normal
Check more on the “mvtnorm” package: https://www.rdocumentation.org/packages/mvtnorm/versions/1.1-3
#install.packages("scatterplot3d") # Install
library("scatterplot3d") # load
library(mvtnorm) #multivariate normal #3d plot set.seed(1)
N <- 2000 # Number of random samples set.seed(123)
# Target parameters for univariate normal distributions
rho <- -0.9
mu1 <- 1; s1 <- 2 #sigmax
mu2 <- 1; s2 <- 2 #sigmay
# Parameters for bivariate normal distribution
mu <- c(mu1,mu2) # Mean
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),2) # Covariance matrix
#or
#sigma = diag(2)
XY = rmvnorm(N, mu, sigma)
fxy = dmvnorm(XY, mu, sigma)
x = as.matrix(XY[,1])
y = as.matrix(XY[,2])
z = as.matrix(fxy)
scatterplot3d(x, y, z, main='rho=-0.9')3. Conditional expectations from bivariate normal distribution
Make an approximate sample from \(X|Y = 1\).
Since Y is a continuous variable, we won’t be able to find values where Y is exactly equal to 1. Therefore, we will throw away all cases where \(Y\) is not close to 1.That is we keep only the samples where \(.95 < Y < 1.05\).
y0 = 1
keep = mydf$y < y0 + .05 & mydf$y > y0 - .05 #because continous r.v
x.y = mydf$x[keep] #conditional X|Y=1
hist(x.y)fig <- plot_ly(x = ~x.y, type = "histogram") %>%
layout(title = 'Distribution of the Conditional Variable X|Y=1')
figdf <- data.frame(y = x.y)
p <- ggplot(df, aes(sample = y))
p + stat_qq() + stat_qq_line()+
ggtitle ("Q-Q plot of the Conditional variable")#qqnorm(x.y)
mean(x.y)## [1] 1.420377
var(x.y)## [1] 0.07489792
It is clear that the conditional distribution is also follows a normal distribution.
- Theoretical mean and variance:
#given in the picture above
mymux + myrho*mysigx/mysigy*(y0 - mymuy)## [1] 1.4
(1-myrho^2)*mysigx^2## [1] 0.09
- Scatterplot and theoretical regression line
We must plot x against y.
The red line shows the function \(E(X|Y=y)\)
plot(mydf$y,mydf$x, pch = 46) #joint distribution of x and y
abline(a = mymux - myrho*mysigx/mysigy*mymuy, b = myrho*mysigx/mysigy, col = 2, lwd = 3) Multinomial distribution
Recall: The multinomial distribution is a generalization of the binomial distribution to k categories instead of just binary (success/fail). For n independent trials each of which leads to a success for exactly one of k categories, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.
Example: Rolling a die N times
Multinomial Examples
1. Chess Game Prediction
Two chess players have the probability Player A would win is 0.40, Player B would win is 0.35, game would end in a draw is 0.25.
The multinomial distribution can be used to answer questions such as: “If these two chess players played 12 games, what is the probability that Player A would win 7 games, Player B would win 2 games, the remaining 3 games would be drawn?”
#12 trails, 3 variables (A win, B win, draw)
dmultinom(x=c(7,2,3), prob = c(0.4,0.35,0.25))## [1] 0.02483712
2. Opinion Polls on Election
In a little town, 40% of the eligible voters prefer candidate A, 10% prefer candidate B, 50% have no preference.
You randomly sample 10 eligible voters. What is the probability that 4 will prefer candidate A, 1 will prefer candidate B, 5 will have no preference ?
dmultinom(x=c(4,1,5), prob = c(0.4,0.1,0.5))## [1] 0.1008
2. Generate Multinomial Random Variables
Generate Multinomial Random Variables With varying Probabilities,
given a matrix of multinomial probabilities where rows correspond to
observations and columns to categories (and each row sums to 1).
Generate a matrix with the same number of rows as my_prob
and with m columns. The columns represent multinomial cell numbers, and
within a row, the columns are all samples from the same multinomial
distribution.
# rmultinom(n, size, prob)
my_prob <- c(0.2,0.3,0.1,0.4)
number_of_experiments <- 20 #trials(a). Number of Samples = 10
number_of_samples <- 10
experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 1 3 3 2 3 3 1 1 1 1 1 2 1
## [2,] 2 5 1 2 2 3 0 3 4 3 3 5 3 3
## [3,] 0 1 2 0 2 1 1 0 0 1 2 1 0 0
## [4,] 8 3 4 5 4 3 6 6 5 5 4 3 5 6
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 1 1 2 1 3 3
## [2,] 3 2 2 4 0 2
## [3,] 0 1 1 3 0 0
## [4,] 6 6 5 2 7 5
df=data.frame(experiments)
df## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
## 1 0 1 3 3 2 3 3 1 1 1 1 1 2 1 1 1 2 1 3 3
## 2 2 5 1 2 2 3 0 3 4 3 3 5 3 3 3 2 2 4 0 2
## 3 0 1 2 0 2 1 1 0 0 1 2 1 0 0 0 1 1 3 0 0
## 4 8 3 4 5 4 3 6 6 5 5 4 3 5 6 6 6 5 2 7 5
colSums(df)## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
## 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
dff= t(df)
colnames(dff) <- c("x1","x2","x3","x4")
rownames(dff) <- NULL
dff## x1 x2 x3 x4
## [1,] 0 2 0 8
## [2,] 1 5 1 3
## [3,] 3 1 2 4
## [4,] 3 2 0 5
## [5,] 2 2 2 4
## [6,] 3 3 1 3
## [7,] 3 0 1 6
## [8,] 1 3 0 6
## [9,] 1 4 0 5
## [10,] 1 3 1 5
## [11,] 1 3 2 4
## [12,] 1 5 1 3
## [13,] 2 3 0 5
## [14,] 1 3 0 6
## [15,] 1 3 0 6
## [16,] 1 2 1 6
## [17,] 2 2 1 5
## [18,] 1 4 3 2
## [19,] 3 0 0 7
## [20,] 3 2 0 5
rowSums(dff)## [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
(b). plot the marginals
(dfM=data.frame(experiments)/number_of_samples ) #gives me probabilities## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
## 1 0.0 0.1 0.3 0.3 0.2 0.3 0.3 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.2 0.1 0.3
## 2 0.2 0.5 0.1 0.2 0.2 0.3 0.0 0.3 0.4 0.3 0.3 0.5 0.3 0.3 0.3 0.2 0.2 0.4 0.0
## 3 0.0 0.1 0.2 0.0 0.2 0.1 0.1 0.0 0.0 0.1 0.2 0.1 0.0 0.0 0.0 0.1 0.1 0.3 0.0
## 4 0.8 0.3 0.4 0.5 0.4 0.3 0.6 0.6 0.5 0.5 0.4 0.3 0.5 0.6 0.6 0.6 0.5 0.2 0.7
## X20
## 1 0.3
## 2 0.2
## 3 0.0
## 4 0.5
dffM= t(dfM)
colnames(dffM) <- c("x1","x2","x3","x4")
rownames(dffM) <- NULL
dffM## x1 x2 x3 x4
## [1,] 0.0 0.2 0.0 0.8
## [2,] 0.1 0.5 0.1 0.3
## [3,] 0.3 0.1 0.2 0.4
## [4,] 0.3 0.2 0.0 0.5
## [5,] 0.2 0.2 0.2 0.4
## [6,] 0.3 0.3 0.1 0.3
## [7,] 0.3 0.0 0.1 0.6
## [8,] 0.1 0.3 0.0 0.6
## [9,] 0.1 0.4 0.0 0.5
## [10,] 0.1 0.3 0.1 0.5
## [11,] 0.1 0.3 0.2 0.4
## [12,] 0.1 0.5 0.1 0.3
## [13,] 0.2 0.3 0.0 0.5
## [14,] 0.1 0.3 0.0 0.6
## [15,] 0.1 0.3 0.0 0.6
## [16,] 0.1 0.2 0.1 0.6
## [17,] 0.2 0.2 0.1 0.5
## [18,] 0.1 0.4 0.3 0.2
## [19,] 0.3 0.0 0.0 0.7
## [20,] 0.3 0.2 0.0 0.5
rowSums(dffM)## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
par(mfrow=c(2,2))
for(i in 1:4) {
barplot(dffM[,i])
}3. Topic Modeling (LDA)
Example: In Data mining, When we discuss everything in terms of text classification, i.e. Topic Modeling:
Each document has its own distribution over topics. Each topic has its own distribution over the words.
The (multinomial) distribution over words for a particular topic The (multinomial) distribution over topics for a particular document. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Latent Dirichlet allocation(LDA) is one of the most common algorithms for topic modeling.
Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.
LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document.
https://www.tidytextmining.com/topicmodeling.html
library(topicmodels)
library(tm)## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
StatCorpus <- Corpus(DirSource("SMCorpus"))
(summary(StatCorpus))## Length Class Mode
## T1.txt 2 PlainTextDocument list
## T2.txt 2 PlainTextDocument list
## T3.txt 2 PlainTextDocument list
## T4.txt 2 PlainTextDocument list
## T5.txt 2 PlainTextDocument list
## T6.txt 2 PlainTextDocument list
MyStopwords<-c("form","have","that","must","very","find","provides")
Stat_dtm <- DocumentTermMatrix(StatCorpus,
control = list(
#stopwords = TRUE, ## remove normal stopwords
wordLengths=c(4, 10),
## get rid of words of len 4 or smaller or larger than 10
removePunctuation = TRUE,
removeNumbers = TRUE,
tolower=TRUE,
#stemming = TRUE,
remove_separators = TRUE,
stopwords = MyStopwords,
removeWords
#bounds = list(global = c(minTermFreq, maxTermFreq))
))## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
DTM_mat <- as.matrix(Stat_dtm)
DTM_mat[-1,]## Terms
## Docs analysis available carry concepts data deep important order scientists
## T2.txt 1 0 0 0 1 0 0 0 0
## T3.txt 1 0 0 0 1 0 0 0 0
## T4.txt 0 0 0 0 1 0 0 0 0
## T5.txt 0 0 0 0 1 0 0 0 0
## T6.txt 0 0 0 0 0 0 0 0 0
## Terms
## Docs statistics analyze draw from gather review studies collection concerns
## T2.txt 2 1 1 1 1 1 1 0 0
## T3.txt 1 0 0 0 0 0 0 1 1
## T4.txt 0 0 0 0 0 0 0 0 0
## T5.txt 1 0 0 0 0 0 0 0 0
## T6.txt 0 0 0 1 0 0 0 0 0
## Terms
## Docs discipline access computer focuses learn learning machine programs
## T2.txt 0 0 0 0 0 0 0 0
## T3.txt 1 0 0 0 0 0 0 0
## T4.txt 0 1 1 1 1 1 1 1
## T5.txt 0 0 0 0 0 1 1 0
## T6.txt 0 0 0 0 1 1 1 0
## Terms
## Docs themselves algorithms amounts massive patterns ability artificial
## T2.txt 0 0 0 0 0 0 0
## T3.txt 0 0 0 0 0 0 0
## T4.txt 1 0 0 0 0 0 0
## T5.txt 0 1 1 1 1 0 0
## T6.txt 0 0 0 0 0 1 1
## Terms
## Docs being experience explicitly improve programmed systems without
## T2.txt 0 0 0 0 0 0 0
## T3.txt 0 0 0 0 0 0 0
## T4.txt 0 0 0 0 0 0 0
## T5.txt 0 0 0 0 0 0 0
## T6.txt 1 1 1 1 1 1 1
x<- DTM_mat[-1,]
ap_lda <- LDA(x , k = 2, control = list(seed = 1234))
#ap_ldaThe tidytext package provides this method for extracting the per-topic-per-word probabilities, called (“beta”), from the model.
library(tidytext)
ap_topics <- tidy(ap_lda, matrix = "beta")
#ap_topicsNotice that this has turned the model into a one-topic-per-term-per-row format. For each combination, the model computes the probability of that term being generated from that topic. For example, the term “data” has a 0.066 probability of being generated from topic 1, but a 0.172 probability of being generated from topic 2.
We could use dplyr’s top_n() to find the 10 terms that are most common within each topic. As a tidy data frame, this lends itself well to a ggplot2 visualization.
library(ggplot2)
library(dplyr)
#%>% pipe operater; nested functions
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called (“gamma”), with the matrix = “gamma” argument to tidy().
ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents## # A tibble: 10 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 T2.txt 1 0.0105
## 2 T3.txt 1 0.0173
## 3 T4.txt 1 0.305
## 4 T5.txt 1 0.987
## 5 T6.txt 1 0.992
## 6 T2.txt 2 0.989
## 7 T3.txt 2 0.983
## 8 T4.txt 2 0.695
## 9 T5.txt 2 0.0131
## 10 T6.txt 2 0.00815
Each of these values is an estimated proportion of words from that document that are generated from that topic. For example, the model estimates that about 99% of the words in document 1 were generated from topic 2.